Classifying Very-High-Dimensional Data with Random Forests of Oblique Decision Trees
نویسندگان
چکیده
The random forests method is one of the most successful ensemble methods. However, random forests do not have high performance when dealing with very-high-dimensional data in presence of dependencies. In this case one can expect that there exist many combinations between the variables and unfortunately the usual random forests method does not effectively exploit this situation. We here investigate a new approach for supervised classification with a huge number of numerical attributes. We propose a random oblique decision trees method. It consists of randomly choosing a subset of predictive attributes and it uses SVM as a split function of these attributes. We compare, on 25 datasets, the effectiveness with classical measures (e.g. precision, recall, F1-measure and accuracy) of random forests of random oblique decision trees with SVMs and random forests of C4.5. Our proposal has significant better performance on very-high-dimensional datasets with slightly better results on lower dimensional datasets. Thanh-Nghi Do Institut Telecom; Telecom Bretagne UMR CNRS 3192 Lab-STICC Université européenne de Bretagne, France Can Tho University, Vietnam e-mail: [email protected] Philippe Lenca Institut Telecom; Telecom Bretagne UMR CNRS 3192 Lab-STICC Université européenne de Bretagne, France e-mail: [email protected] Stéphane Lallich Université de Lyon, Laboratoire ERIC, Lyon 2, France e-mail: [email protected] Nguyen-Khang Pham IRISA, Rennes, France Can Tho University, Vietnam e-mail: [email protected]
منابع مشابه
Hybrid weighted random forests for classifying very high-dimensional data
Random forests are a popular classification method based on an ensemble of a single type of decision trees from subspaces of data. In the literature, there are many different types of decision tree algorithms, including C4.5, CART, and CHAID. Each type of decision tree algorithm may capture different information and structure. This paper proposes a hybrid weighted random forest algorithm, simul...
متن کاملOn Oblique Random Forests
Abstract. In his original paper on random forests, Breiman proposed two different decision tree ensembles: one generated from “orthogonal” trees with thresholds on individual features in every split, and one from “oblique” trees separating the feature space by randomly oriented hyperplanes. In spite of a rising interest in the random forest framework, however, ensembles built from orthogonal tr...
متن کاملStratified sampling for feature subspace selection in random forests for high dimensional data
For high dimensional data a large portion of features are often not informative of the class of the objects. Random forest algorithms tend to use a simple random sampling of features in building their decision trees and consequently select many subspaces that contain few, if any, informative features. In this paper we propose a stratified sampling method to select the feature subspaces for rand...
متن کاملData Mining of High Accuracy for the Efficiency in the Task of Massive Printing
Random forests are known to be robust for missing and erroneous data as well as irrelevant features. Moreover, even though the forests have many trees, they can utilize the fast building property of decision trees, so they do not require much computing time. In this paper an efficient procedure that utilizes random forests to predict the cylinder bands in rotogravure printing is shown. Even tho...
متن کاملREGRESSION LEAF FOREST: A FAST AND ACCURATE LEARNING METHOD FOR LARGE & HIGH DIMENSIONAL DATA SETS by SIVANESAN GANESAN
There are a number of learning methods that provide solutions to classification and regression problems, including Linear Regression, Decision Trees, KNN, and SVMs. These methods work well in many applications, but they are challenged for real world problems that are noisy, nonlinear or high dimensional. Furthermore, missing data (e.g., missing historical features of companies in stock data), i...
متن کامل